44

Algorithms for Binary Neural Networks

where θ and λ are hyper parameters, M = {M 1, ..., M N} are M-Filters, and ˆC is the

binarized filter set across all layers. Operationdefined in Eq. 3.12 is used to approximate

unbinarized filters based on binarized filters and M-Filters, leading to filter loss as the first

term on the right of Eq. 3.18. The second term on the right is similar to the center loss

used to evaluate intraclass compactness, which deals with the feature variation caused by

the binarization process. fm( ˆC, M) denotes the feature map of the last convolutional layer

for the mth sample, and f( ˆC, M) denotes the class-specific mean feature map of previous

samples. We note that the center loss is successfully deployed to handle feature variations.

We only keep the binarized filters and the shared M-Filters (quite small) to reduce the

storage space to calculate the feature maps after training. We consider the conventional

loss and then define a new loss function LS,M = LS + LM, where LS is the conventional

loss function, e.g., softmax loss.

Again, we consider the quantization process in our loss LS,M, and obtain the final

minimization objective as:

L(C, ˆC, M) = LS,M + θ

2C[k]Cηδ[k]



C 2,

(3.19)

where θ is shared with Eq. 3.18 to reduce the number of parameters. δ[k]



C

is the gradient

of LS,M with respect to C[k]. Unlike conventional methods (such as XNOR), where only

the filter reconstruction is considered in the weight calculation, our discrete optimization

method provides a comprehensive way to calculate binarized CNNs by considering filter

loss, softmax loss, and feature compactness in a unified framework.

3.4.3

Back-Propagation Updating

In MCNs, unbinarized filters Ci and M-Filters M must be learned and updated. These two

types of filters are jointly learned. In each convolutional layer, MCNs sequentially update

unbinarized filters and M-Filters.

Updating unbinarized filters: The gradient δ ˆ

C corresponding to Ci is defined as

δ ˆ

C = ∂L

ˆCi

= ∂LS

ˆCi

+ ∂LM

ˆCi

+ θ( C[k] C[k] η1δ[k]



C ),

(3.20)

CiCiη1δ ˆ

C,

(3.21)

where L, LS, and LM are loss functions, and η1 is the learning rate. Furthermore, we have

the following.

∂LS

ˆCi

= ∂LS

∂Q · ∂Q

ˆCi

=



j

∂LS

∂Qij

· M

j,

(3.22)

∂LM

ˆCi

= θ



j

(CiˆCiMj)Mj,

(3.23)

Updating M-Filters: We further update the M-Filter M with C fixed. δM is defined as

the gradient of M, and we have:

δM = ∂L

∂M = ∂LS

∂M + ∂LM

∂M ,

(3.24)

M ←|Mη2δM|,

(3.25)